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Abstract 

Tachyon  is  a  distributed  file  system  enabling  reliable 
data  sharing  at  memory  speed  across  cluster  computing 
frameworks.  While  caching  today  improves  read  work¬ 
loads,  writes  are  either  network  or  disk  bound,  as  repli¬ 
cation  is  used  for  fault- tolerance.  Tachyon  eliminates  this 
bottleneck  by  pushing  lineage,  a  well-known  technique 
borrowed  from  application  frameworks,  into  the  storage 
layer.  The  key  challenge  in  making  a  long-lived  lineage- 
based  storage  system  is  timely  data  recovery  in  case  of 
failures.  Tachyon  addresses  this  issue  by  introducing  a 
checkpointing  algorithm  that  guarantees  bounded  recov¬ 
ery  cost  and  resource  allocation  strategies  for  recompu¬ 
tation  under  common  resource  schedulers.  Our  evalua¬ 
tion  shows  that  Tachyon  outperforms  in-memory  HDFS 
by  1  lOx  for  writes.  It  also  improves  the  end-to-end  la¬ 
tency  of  a  realistic  workflow  by  4x.  Tachyon  is  open 
source  and  is  deployed  at  multiple  companies. 

1  Introduction 

Over  the  past  few  years,  there  have  been  tremendous  ef¬ 
forts  to  improve  the  speed  and  sophistication  of  large- 
scale  data-parallel  processing  systems.  Practitioners  and 
researchers  have  built  a  wide  array  of  programming 
frameworks  [29,  30,  31,  37,  46,  47]  and  storage  sys¬ 
tems  [13, 14,  22,  23,  34]  tailored  to  a  variety  of  workloads. 

As  the  performance  of  many  of  these  systems  is  I/O 
bound,  traditional  means  of  improving  their  speed  is  to 
cache  data  into  memory  [8,  11].  While  caching  can 
dramatically  improve  read  performance,  unfortunately,  it 
does  not  help  much  with  write  performance.  This  is  be¬ 
cause  these  highly  parallel  systems  need  to  provide  fault- 
tolerance,  and  the  way  they  achieve  it  is  by  replicating  the 
data  written  across  nodes.  Even  replicating  the  data  in 
memory  can  lead  to  a  significant  drop  in  the  write  perfor¬ 
mance,  as  both  the  latency  and  throughput  of  the  network 
are  typically  much  worse  than  that  of  local  memory. 

Slow  writes  can  significantly  hurt  the  performance  of 
job  pipelines,  where  one  job  consumes  the  output  of  an¬ 


other.  These  pipelines  are  regularly  produced  by  work- 
flow  managers  such  as  Oozie  [6]  and  Luigi  [9],  e.g.,  to 
perform  data  extraction  with  MapReduce,  then  execute  a 
SQL  query,  then  run  a  machine  learning  algorithm  on  the 
query’s  result.  Furthermore,  many  high-level  program¬ 
ming  interfaces  [2,  5,  40],  such  as  Pig  [33]  and  Flume- 
Java  [16],  compile  programs  into  multiple  MapReduce 
jobs  that  run  sequentially.  In  all  these  cases,  data  is  repli¬ 
cated  across  the  network  in-between  each  of  the  steps. 

To  improve  write  performance,  we  present  Tachyon ,  an 
in-memory  storage  system  that  achieves  high  throughput 
writes  and  reads,  without  compromising  fault-tolerance. 
Tachyon  circumvents  the  throughput  limitations  of  repli¬ 
cation  by  leveraging  the  concept  of  lineage ,  where  a  lost 
output  is  recovered  by  re-executing  the  operations  (tasks) 
that  created  the  output.  As  a  result,  lineage  provides  fault- 
tolerance  without  the  need  for  replicating  the  data. 

While  the  concept  of  lineage  has  been  used  before  in 
the  context  of  computing  frameworks  like  Spark  and  Nec¬ 
tar  [24,  46],  Tachyon  is  the  first  system  to  push  lineage 
into  the  storage  layer  for  performance  gains.  This  raises 
several  new  challenges  that  do  not  exist  in  previous  sys¬ 
tems,  which  have  so  far  focused  on  recomputing  the  lost 
outputs  within  a  single  job  and/or  a  single  computing 
framework. 

The  first  challenge  is  hounding  the  recomputation  cost 
for  a  long-running  storage  system.  This  challenge  does 
not  exist  for  a  single  job,  such  as  a  MapReduce  or  Spark 
job,  as  in  this  case,  the  recomputation  time  is  trivially 
bounded  by  the  job’s  execution  time.  In  contrast,  Tachyon 
runs  indefinitely,  which  means  that  the  recomputation 
time  can  be  unbounded.  Previous  frameworks  that  sup¬ 
port  long  running  jobs,  such  as  Spark  Streaming  [47], 
circumvent  this  challenge  by  using  periodic  checkpoint¬ 
ing.  However,  in  doing  so,  they  leverage  the  semantics 
of  their  programming  model  to  decide  when  and  what  to 
checkpoint.  Unfortunately,  using  the  same  techniques  in 
Tachyon  is  difficult,  as  the  storage  layer  is  agnostic  to  the 
semantics  of  the  jobs  running  on  the  data  (e.g.,  when  out¬ 
puts  will  be  reused),  and  job  execution  characteristics  can 


vary  widely. 

The  second  challenge  is  how  to  allocate  resources  for 
recomputations.  For  example,  if  jobs  have  priorities, 
Tachyon  must,  on  the  one  hand,  make  sure  that  recom¬ 
putation  tasks  get  adequate  resources  (even  if  the  cluster 
is  fully  utilized),  and  on  the  other  hand,  Tachyon  must 
ensure  that  recomputation  tasks  do  not  severely  impact 
the  performance  of  currently  running  jobs  with  possibly 
higher  priorities. 

Tachyon  bounds  data  recomputation  cost,  thus  address¬ 
ing  the  first  challenge,  by  continuously  checkpointing  files 
asynchronously  in  the  background.  To  this  end,  we  pro¬ 
pose  a  novel  algorithm,  called  the  Edge  algorithm,  that  re¬ 
quires  no  knowledge  of  the  job’s  semantics  and  provides 
an  upper  bound  on  the  recomputation  cost  regardless  of 
the  access  pattern  of  the  workload. 

To  address  the  second  challenge,  Tachyon  provides  re¬ 
source  allocation  schemes  that  respect  job  priorities  un¬ 
der  two  common  cluster  allocation  models:  strict  prior¬ 
ity  and  weighted  fair  sharing  [27,  45].  For  example,  in 
a  cluster  using  a  strict  priority  scheduler,  if  a  missing 
input  is  requested  by  a  low  priority  job,  the  recomputa¬ 
tion  minimizes  its  impact  on  high  priority  jobs.  However, 
if  the  same  input  is  later  requested  by  a  higher  priority 
job,  Tachyon  automatically  increases  the  amount  of  re¬ 
sources  allocated  for  recomputation  to  avoid  priority  in¬ 
version  [28]. 

We  have  implemented  Tachyon  with  a  general  lineage- 
specification  API  that  can  capture  computations  in  many 
of  today’s  popular  data-parallel  computing  models,  e.g., 
MapReduce  and  SQL.  We  also  ported  the  Hadoop  and 
Spark  frameworks  to  run  on  top  of  it.  The  project  is  open 
source,  has  more  than  40  contributors  from  over  10  insti¬ 
tutions,  and  is  deployed  at  multiple  companies. 

Our  evaluation  shows  that  on  average,  Tachyon1 
achieves  11  Ox  higher  write  throughput  than  in-memory 
HDFS  [3].  In  a  realistic  industry  workflow,  Tachyon  im¬ 
proves  end-to-end  latency  by  4x  compared  to  in-memory 
HDFS.  In  addition,  because  many  files  in  computing  clus¬ 
ters  are  temporary  files  that  get  deleted  before  they  are 
checkpointed,  Tachyon  can  reduce  replication-caused  net¬ 
work  traffic  by  up  to  50%.  Finally,  based  on  traces  from 
Facebook  and  Bing,  Tachyon  would  consume  no  more 
than  1.6%  of  cluster  resources  for  recomputation. 

More  importantly,  due  to  the  inherent  bandwidth  limita¬ 
tions  of  replication,  a  lineage-based  recovery  model  might 
be  the  only  way  to  make  cluster  storage  systems  match 
the  speed  of  in-memory  computations  in  the  future.  This 

1  This  paper  focus  on  in-memory  Tachyon  deployment.  However, 
Tachyon  can  also  speed  up  SSD-  and  disk-based  systems  if  the  aggregate 
local  I/O  bandwidth  is  higher  than  the  network  bandwidth. 


Media 

Capacity 

Bandwith 

HDD  (xl2) 

12-36  TB 

0.2-2  GB/sec 

SDD  (x4) 

1-4  TB 

1-4  GB/sec 

Network 

N/A 

1.25  GB/sec 

Memory 

128-512  GB 

10-100  GB/sec 

Table  1 :  Typical  datacenter  node  setting  [7] . 


work  aims  to  address  some  of  the  leading  challenges  in 
making  such  a  system  possible. 

2  Background 

This  section  describes  our  target  workload  and  provides 
background  on  existing  solutions  and  the  lineage  concept. 
Section  8  describes  related  work  in  greater  detail. 

2.1  Target  Workload 

We  have  designed  Tachyon  for  a  target  environment  based 
on  today’s  big  data  workloads: 

•  Immutable  data:  Data  is  immutable  once  written, 
since  dominant  underlying  storage  systems,  such  as 
HDFS  [3],  only  support  the  append  operation. 

•  Deterministic  jobs:  Many  frameworks,  such  as 
MapReduce  [20]  and  Spark  [46],  use  recomputation 
for  fault  tolerance  within  a  job  and  require  user  code  to 
be  deterministic.  We  provide  lineage-based  recovery 
under  the  same  assumption.  Nondeterministic  frame¬ 
works  can  still  store  data  in  Tachyon  using  replication. 

•  Locality  based  scheduling:  Many  computing  frame¬ 
works  [20,  46]  schedule  jobs  based  on  locality  to  min¬ 
imize  network  transfers,  so  reads  can  be  data-local. 

•  Program  size  vs.  data  size:  In  big  data  processing,  the 
same  operation  is  repeatedly  applied  on  massive  data. 
Therefore,  replicating  programs  is  much  less  expen¬ 
sive  than  replicating  data. 

•  All  data  vs.  working  set:  Even  though  the  whole  data 
set  is  large  and  has  to  be  stored  on  disks,  the  working 
set  of  many  applications  fits  in  memory  [11,  46]. 

2.2  Existing  Solutions 

In-memory  computation  frameworks  -  such  as  Spark  and 
Piccolo  [37],  as  well  as  caching  in  storage  systems  -  have 
greatly  sped  up  the  performance  of  individual  jobs.  How¬ 
ever,  sharing  (writing)  data  reliably  among  different  jobs 
often  becomes  a  bottleneck. 

The  write  throughput  is  limited  by  disk  (or  SSD)  and 
network  band  widths  in  existing  storage  solutions,  such 
as  HDFS  [3],  FDS  [13],  Cassandra  [1],  HBase  [4],  and 
RAMCloud  [34].  All  these  systems  use  media  with  much 
lower  bandwidth  than  memory  (Table  1). 
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The  fundamental  issue  is  that  in  order  to  be  fault- 
tolerant,  these  systems  replicate  data  across  the  network 
and  write  at  least  one  copy  onto  non-volatile  media  to 
allow  writes  to  survive  datacenter- wide  failures,  such  as 
power  outages.  Because  of  these  limitations  and  the  ad¬ 
vancement  of  in-memory  computation  frameworks  [29, 
30,  37,  46],  inter-job  data  sharing  cost  often  dominates 
pipeline’s  end-to-end  latencies  for  big  data  workloads. 
While  some  jobs’  outputs  are  much  smaller  than  their  in¬ 
puts,  a  recent  trace  from  Cloudera  showed  that,  on  aver¬ 
age,  34%  of  jobs  (weighted  by  execution  time)  across  five 
customers  had  outputs  that  were  at  least  as  large  as  their 
inputs  [17].  In  an  in-memory  computing  cluster,  these 
jobs  would  be  write  throughput  bound. 

Hardware  advancement  is  unlikely  to  solve  the  issue. 
Memory  bandwidth  is  one  to  three  orders  of  magnitude 
higher  than  the  aggregate  disk  bandwidth  on  a  node.  The 
bandwidth  gap  between  memory  and  disk  is  becoming 
larger  because  of  the  different  increasing  rates.  The  emer¬ 
gence  of  SSDs  has  little  impact  on  this  since  its  major 
advantage  over  disk  is  random  access  latency,  but  not  se¬ 
quential  I/O  bandwidth,  which  is  what  most  data-intensive 
workloads  need.  Furthermore,  throughput  increases  in 
network  indicate  that  over-the-network  memory  replica¬ 
tion  might  be  feasible.  However,  sustaining  datacenter 
power  outages  requires  at  least  one  disk  copy  for  the 
system  to  be  fault-tolerant.  Hence,  in  order  to  provide 
high  throughput,  storage  systems  have  to  achieve  fault- 
tolerance  without  replication. 

2.3  Lineage 

Lineage  has  been  used  in  various  areas,  such  as  scien¬ 
tific  computing  [15]  and  databases  [18].  Applications 
include  confidence  computation,  view  maintenance,  and 
data  quality  control,  etc. 

Recently,  the  concept  has  been  successfully  applied 
in  several  computation  frameworks,  e.g.,  Spark,  MapRe¬ 
duce,  and  Dryad.  These  frameworks  track  data  depen¬ 
dencies  within  a  job,  and  recompute  when  a  task  fails. 
However,  when  different  jobs,  possibly  written  in  differ¬ 
ent  frameworks,  share  data,  the  data  needs  to  be  written 
to  a  storage  system.  Nectar  [24]  also  uses  lineage  for  a 
specific  framework  (DryadLINQ)  with  the  goal  of  saving 
space  and  avoid  computing  results  that  have  already  been 
computed  by  previous  queries. 

Due  to  the  characteristics  outlined  in  Section  2.1,  we 
see  the  use  of  lineage  as  an  exciting  opportunity  for  pro¬ 
viding  similar  recovery,  not  just  within  jobs/frameworks, 
but  also  across  them,  through  a  distributed  storage  system. 
However,  recomputation-based  recovery  comes  with  a  set 
of  challenges  when  applied  at  the  storage  system  level, 
which  the  remainder  of  this  paper  is  devoted  to  address- 


Figure  1 :  Tachyon  Architecture. 

ing. 

3  Design  Overview 

This  section  overviews  the  design  of  Tachyon,  while  the 
following  two  sections  (§4  &  §5)  focus  on  the  two  main 
challenges  that  a  storage  system  incorporating  lineage 
faces:  bounding  recovery  cost  and  allocating  resources  for 
recomputation. 

3.1  System  Architecture 

Tachyon  consists  of  two  layers:  lineage  and  persistence. 
The  lineage  layer  tracks  the  sequence  of  jobs  that  have 
created  a  particular  data  output.  The  persistence  layer  per¬ 
sists  data  onto  storage.  This  is  mainly  used  to  do  asyn¬ 
chronous  checkpoints.  The  details  of  the  persistence  layer 
are  similar  to  many  other  storage  systems.  Since  the  per¬ 
sistence  layer  is  common  to  many  storage  systems,  we 
focus  in  this  paper  on  asynchronous  checkpointing  (Sec¬ 
tion  4). 

Tachyon  employs  a  standard  master- slave  architecture 
similar  to  HDFS  and  GFS  (see  Figure  1).  In  the  remainder 
of  this  section  we  discuss  the  unique  aspects  of  Tachyon. 

In  addition  to  managing  metadata,  the  master  also  con¬ 
tains  a  workflow  manager.  The  role  of  this  manager  is  to 
track  lineage  information,  compute  checkpoint  order  (§4), 
and  interact  with  a  cluster  resource  manager  to  allocate 
resources  for  recomputation  (§5). 

Each  worker  runs  a  daemon  that  manages  local  re¬ 
sources,  and  periodically  reports  the  status  to  the  mas¬ 
ter.  In  addition,  each  worker  uses  a  RAMdisk  for  storing 
memory-mapped  files.  A  user  application  can  bypass  the 
daemon  and  read  directly  from  RAMdisk.  This  way,  an 
application  colocated  with  data  will  read  the  data  at  mem¬ 
ory  speeds,  while  avoiding  any  extra  data  copying. 

3.2  An  Example 

To  illustrate  how  Tachyon  works,  consider  the  following 
example.  Assume  job  P  reads  file  set  A  and  writes  file  set 
B.  Before  P  produces  the  output,  it  submits  its  lineage  in¬ 
formation  L  to  Tachyon.  This  information  describes  how 
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Figure  2:  Multiple  frameworks  lineage  graph  example. 


Return 

Signature 

Global  Unique 
Lineage  Id 

createDependency(inputFiles,  output- 
Files,  binary  Programs,  executionCon- 
figuration,  dependency  Type) 

Dependency  Info 

getDependency(lineageld) 

Table  2:  Submit  and  Retrieve  Lineage  API 


to  run  P  (e.g.,  comand  line  arguments,  configuration  pa¬ 
rameters)  to  generate  B  from  A.  Tachyon  records  L  reli¬ 
ably  using  the  persistence  layer.  L  guarantees  that  if  B  is 
lost,  Tachyon  can  recompute  it  by  (partially)  re-executing 
P.  As  a  result,  leveraging  the  lineage,  P  can  write  a  sin¬ 
gle  copy  of  B  to  memory  without  compromising  fault- 
tolerance.  Figure  2  shows  a  more  complex  lineage  exam¬ 
ple. 

Recomputation  based  recovery  assumes  that  input  files 
are  immutable  (or  versioned,  c.fi,  §9)  and  that  the  execu¬ 
tions  of  the  jobs  are  deterministic.  While  these  assump¬ 
tions  are  not  true  of  all  applications,  they  apply  to  a  large 
fraction  of  datacenter  workloads  ( c.f. ,  §2.1),  which  are 
deterministic  applications  (often  in  a  high-level  language 
such  as  SQL  where  lineage  is  simple  to  capture). 

3.3  API  Summary 

Tachyon  is  an  append-only  file  system,  similar  to  HDFS, 
that  supports  standard  file  operations,  such  as  create, 
open,  read,  write,  close,  and  delete.  In  addition,  Tachyon 
provides  an  API  to  capture  the  lineage  across  different 
jobs  and  frameworks.  Table  2  lists  the  lineage  API2,  and 
Section  6.1  describes  this  API  in  detail. 

3.4  Lineage  Overhead 

In  terms  of  storage  overhead,  job  binaries  represent  by  far 
the  largest  component  of  the  lineage  information.  How¬ 
ever,  according  to  Microsoft  data  [24],  a  typical  data  cen¬ 
ter  runs  1,  000  jobs  daily  on  average,  and  it  takes  up  to  1 
TB  to  store  the  uncompressed  binaries  of  all  jobs  executed 
over  a  one  year  interval.  This  overhead  is  negligible  even 
for  a  small  sized  data  center. 

Furthermore,  Tachyon  can  garbage  collect  the  lineage 

2  A  user  can  choose  to  use  Tachyon  as  a  traditional  hie  system  if 
he/she  does  not  use  the  lineage  API. 


information.  In  particular,  Tachyon  can  delete  a  lineage 
record  after  checkpointing  (c.f.,  §4)  its  output  files.  This 
will  dramatically  reduce  the  overall  size  of  the  lineage  in¬ 
formation.  In  addition,  in  production  environments,  the 
same  binary  program  is  often  executed  many  times,  e.g., 
periodic  jobs,  with  different  parameters.  In  this  case,  only 
one  copy  of  the  program  needs  to  be  stored. 

3.5  Data  Eviction 

Tachyon  works  best  when  the  workload’s  working  set  fits 
in  memory.  In  this  context,  one  natural  question  is  what 
is  the  eviction  policy  when  the  memory  fills  up.  Our  an¬ 
swer  to  this  question  is  influenced  by  the  following  char¬ 
acteristics  identified  by  previous  works  [17,  38]  for  data 
intensive  applications: 

•  Access  Frequency:  File  access  often  follows  a  Zipf- 
like  distribution  (see  [17,  Figure  2]). 

•  Access  Temporal  Locality:  75%  of  the  re-accesses 
take  place  within  6  hours  (see  [17,  Figure  5]). 

Based  on  these  characteristics,  we  use  LRU  as  a  default 
policy.  However,  since  LRU  may  not  work  well  in  all 
scenarios,  Tachyon  also  allows  plugging  in  other  eviction 
policies.  Finally,  as  we  describe  in  Section  4,  Tachyon 
stores  all  but  the  largest  files  in  memory.  The  rest  are 
stored  directly  to  the  persistence  layer. 

3.6  Master  Fault-Tolerance 

As  shown  in  Figure  1,  Tachyon  uses  a  “passive  standby” 
approach  to  ensure  master  fault- tolerance.  The  master 
logs  every  operation  synchronously  to  the  persistence 
layer.  When  the  master  fails,  a  new  master  is  selected 
from  the  standby  nodes.  The  new  master  recovers  the  state 
by  simply  reading  the  log.  Note  that  since  the  metadata 
size  is  orders  of  magnitude  smaller  than  the  output  data 
size,  the  overhead  of  storing  and  replicating  it  is  negligi¬ 
ble. 

3.7  Handling  Environment  Changes 

One  category  of  problems  Tachyon  must  deal  with  is 
changes  in  the  cluster’s  runtime  environment.  How  can 
we  rely  on  re-executing  binaries  to  recompute  files  if,  for 
example,  the  version  of  the  framework  that  an  application 
depends  on  changes,  or  the  OS  version  changes? 

One  observation  we  make  here  is  that  although  files’ 
dependencies  may  go  back  in  time  forever,  checkpoint¬ 
ing  allows  us  to  place  a  bound  on  how  far  back  we  ever 
have  to  go  to  recompute  data.  Thus,  before  an  environ¬ 
ment  change,  we  can  ensure  recomputability  by  switching 
the  system  into  a  “synchronous  mode”,  where  (a)  all  cur¬ 
rently  unreplicated  files  are  checkpointed  and  (b)  all  new 
data  is  saved  synchronously.  Once  the  current  data  is  all 
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replicated,  the  update  can  proceed  and  this  mode  can  be 
disabled. 

For  more  efficient  handling  of  this  case,  it  might  also  be 
interesting  to  capture  a  computation’s  environment  using 
a  VM  image  [25].  We  have,  however,  not  yet  explored 
this  option. 

3.8  Discussion 

There  are  several  commonly  asked  questions  when  we 
promoted  our  open  source  project  in  the  past: 

Question  1:  Why  not  just  use  computation  frameworks, 
such  as  Spark,  that  already  incorporate  lineage?  Many 
data  pipelines  consist  of  multiple  jobs.  The  frameworks 
only  know  the  lineage  of  tasks  within  a  job.  There  is 
no  way  to  automatically  reconstruct  the  output  of  a  pre¬ 
vious  job  in  case  of  failures.  Worse  yet,  different  jobs  in 
the  same  pipeline  can  be  written  in  different  frameworks, 
which  renders  a  solution  that  would  extend  lineage  across 
multiple  jobs  in  the  same  framework  useless. 

Question  2:  Aren’t  immutable  data  and  deterministic  pro¬ 
gram  requirements  too  stringent?  As  discussed  in  Sec¬ 
tion  2.1,  existing  cluster  frameworks,  such  as  MapReduce, 
Spark,  and  Dryad,  satisfy  these  requirements,  and  they 
leverage  them  to  provide  fault-recovery  and  straggler  mit¬ 
igation. 

Question  3:  With  one  copy  in  memory,  how  can  Tachyon 
mitigate  hot  spots?  While  Tachyon  leverages  lineage  to 
avoid  data  replication,  it  uses  client-side  caching  to  miti¬ 
gate  hot  spot.  That  is,  if  a  file  is  not  available  on  the  local 
machine,  it  is  read  from  a  remote  machine  and  cached  lo¬ 
cally  in  Tachyon. 

Question  4:  Isn’t  Tachyon’ s  read/write  throughput 
bounded  by  the  network  since  a  cluster  computation  appli¬ 
cation  does  I/O  remotely?  In  our  targeted  workloads  (Sec¬ 
tion  2.1),  computation  frameworks  schedule  tasks  based 
on  data  locality  to  minimize  remote  I/O. 

Question  5:  Is  Tachyon ’s  lineage  API  too  complicated 
for  average  programmers?  Only  framework  programmers 
need  to  understand  Tachyon ’s  lineage  API.  Tachyon  does 
not  place  extra  burden  on  application  programmers.  As 
long  as  a  framework,  e.g.  Spark,  integrates  with  Tachyon, 
applications  on  top  of  the  framework  take  advantage  of 
lineage  based  fault-tolerance  transparently. 

4  Checkpointing 

This  section  outlines  the  checkpoint  algorithm  used  by 
Tachyon  to  bound  the  amount  of  time  it  takes  to  retrieve 
a  file  that  is  lost  due  to  failures3.  By  a  file  we  refer  to  a 

3  In  this  section,  we  assume  recomputation  has  the  same  resource  as 
the  first  time  computation.  In  Section  5,  we  address  the  recomputation 
resource  allocation  issue. 


distributed  file,  e.g.,  all  output  of  a  MapReduce/Spark  job. 
Unlike  other  frameworks,  such  as  MapReduce  and  Spark, 
whose  jobs  are  relatively  short-lived,  Tachyon  runs  con¬ 
tinuously.  Thus,  the  lineage  that  accumulates  can  be  sub¬ 
stantial,  requiring  long  recomputation  time  in  the  absence 
of  checkpoints.  Therefore,  checkpointing  is  crucial  for 
the  performance  of  Tachyon.  Note  that  long-lived  stream¬ 
ing  systems,  such  as  Spark  Streaming  [47],  leverage  their 
knowledge  of  job  semantics  to  decide  what  and  when  to 
checkpoint.  Tachyon  has  to  checkpoint  in  absence  of  such 
detailed  semantic  knowledge. 

The  key  insight  behind  our  checkpointing  approach 
in  Tachyon  is  that  lineage  enables  us  to  asynchronously 
checkpoint  in  the  background ,  without  stalling  writes, 
which  can  proceed  at  memory-speed.  This  is  unlike  other 
storage  systems  that  do  not  have  lineage  information,  e.g., 
key-value  stores,  which  synchronously  checkpoint,  re¬ 
turning  to  the  application  that  invoked  the  write  only  once 
data  has  been  persisted  to  stable  storage.  Tachyon’  back¬ 
ground  checkpointing  is  done  in  a  low  priority  process  to 
avoid  interference  with  existing  jobs.  Whether  the  fore¬ 
ground  job  can  progress  at  memory-speed  naturally  re¬ 
quires  that  its  working  set  can  fit  in  memory  (see  Sec¬ 
tion  3). 

An  ideal  checkpointing  algorithm  would  provide  the 
following: 

1.  Bounded  Recomputation  Time.  Lineage  chains  can 
grow  very  long  in  a  long-running  system  like  Tachyon, 
therefore  the  checkpointing  algorithm  should  provide 
a  bound  on  how  long  it  takes  to  recompute  data  in  the 
case  of  failures.  Note  that  bounding  the  recomputation 
time  also  bounds  the  computational  resources  used  for 
recomputations. 

2.  Checkpointing  Hot  files.  Some  files  are  much  more 
popular  than  others.  For  example,  the  same  file, 
which  represents  a  small  dimension  table  in  a  data- 
warehouse,  is  repeatedly  read  by  all  mappers  to  do  a 
map-side  join  with  a  fact  table  [11]. 

3.  Avoid  Checkpointing  Temporary  Files.  Big  data  work¬ 
loads  generate  a  lot  of  temporary  data.  From  our  con¬ 
tacts  at  Facebook,  nowadays,  more  than  70%  data  is 
deleted  within  a  day,  without  even  counting  shuffle 
data.  Figure  3  a  illustrates  how  long  temporary  data 
exists  in  a  cluster  at  Facebook4.  An  ideal  algorithm 
would  avoid  checkpointing  much  of  this  data. 

We  consider  the  following  straw  man  to  motivate  our 
algorithm:  asynchronously  checkpoint  every  file  in  the 
order  that  it  is  created.  Consider  a  lineage  chain,  where 

4The  workload  was  collected  from  a  3,000  machine  MapReduce 
cluster  at  Facebook,  during  a  week  in  October  2010. 
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file  A\  is  used  to  generate  A2,  which  is  used  to  generate 
A3,  A4,  and  so  on.  By  the  time  A6  is  being  generated, 
perhaps  only  A\  and  A 2  have  been  checkpointed  to  stable 
storage.  If  a  failure  occurs,  then  A3  through  A6  have  to  be 
recomputed.  The  longer  the  chain,  the  longer  the  recom¬ 
putation  time.  Thus,  spreading  out  checkpoints  through¬ 
out  the  chain  would  make  recomputations  faster. 

4.1  Edge  Algorithm 

Based  on  the  above  characteristics,  we  have  designed 
a  simple  algorithm,  called  Edge,  which  builds  on  three 
ideas.  First,  Edge  checkpoints  the  edge  (leaves)  of  the 
lineage  graph  (hence  the  name).  Second,  it  incorporates 
priorities,  favoring  checkpointing  high-priority  files  over 
low-priority  ones.  Finally,  the  algorithm  only  caches 
datasets  that  can  fit  in  memory  to  avoid  synchronous 
checkpointing,  which  would  slow  down  writes  to  disk 
speed.  We  discuss  each  of  these  ideas  in  detail: 

Checkpointing  Leaves.  The  Edge  algorithm  models 
the  relationship  of  files  with  a  DAG,  where  the  vertices 
are  files,  and  there  is  an  edge  from  a  file  A  to  a  file  B 
if  B  was  generated  by  a  job  that  read  A.  The  algorithm 
checkpoints  the  latest  data  by  checkpointing  the  leaves  of 
the  DAG.  This  lets  us  satisfy  the  requirement  of  bounded 
recovery  time  (explained  in  Section  4.2). 

Figure  4  illustrates  how  the  Edge  algorithm  works.  At 
the  beginning,  there  are  only  two  jobs  running  in  the  clus¬ 
ter,  generating  files  A\  and  B\.  The  algorithm  check¬ 
points  both  of  them.  After  they  have  been  checkpointed, 
files  A3,  B4,  E>5,  and  B6  become  leaves.  After  check¬ 
pointing  these,  files  Aq,  Bq  become  leaves. 

To  see  the  advantage  of  Edge  checkpointing,  consider 
the  pipeline  only  containing  A\  to  A6  in  the  above  exam¬ 
ple.  If  a  failure  occurs  when  Aq  is  being  checkpointed, 
Tachyon  only  needs  to  recompute  from  A4  through  Aq  to 
get  the  final  result.  As  previously  mentioned,  checkpoint¬ 
ing  the  earliest  files,  instead  of  the  edge,  would  require  a 
longer  recomputation  chain. 

This  type  of  pipeline  is  common  in  industry.  For  exam¬ 
ple,  continuously  monitoring  applications  generate  hourly 
reports  based  on  minutely  reports,  daily  reports  based  on 
hourly  reports,  and  so  on. 

Checkpointing  Hot  Files.  The  above  idea  of  check¬ 
pointing  the  latest  data  is  augmented  to  first  checkpoint 
high  priority  files.  Tachyon  assigns  priorities  based  on  the 
number  of  times  a  file  has  been  read.  Similar  to  the  LFU 
policy  for  eviction  in  caches,  this  ensures  that  frequently 
accessed  files  are  checkpointed  first.  This  covers  the  case 
when  the  DAG  has  a  vertex  that  is  repeatedly  read  leading 
to  new  vertices  being  created,  i.e.,  a  high  degree  vertex. 
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Table  3:  File  Access  Frequency  at  Yahoo 
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Figure  3:  A  3,000  nodes  MapReduce  cluster  at  Facebook 

These  vertices  will  be  assigned  a  proportionally  high  pri¬ 
ority  and  will  thus  be  checkpointed,  making  recovery  fast. 

Edge  checkpointing  has  to  balance  between  check¬ 
pointing  leaves,  which  guarantee  recomputation  bounds, 
and  checkpointing  hot  files,  which  are  important  for  cer¬ 
tain  iterative  workloads.  Here,  we  leverage  the  fact  that 
most  big  data  workloads  have  a  Zipf-distributed  popular¬ 
ity  (this  has  been  observed  by  many  others  [11,  17]).  Ta¬ 
ble  3  shows  what  percentage  of  the  files  are  accessed  less 
than  (or  equal)  than  some  number  of  times  in  a  3,000- 
node  MapReduce  cluster  at  Yahoo  in  January  2014.  Based 
on  this,  we  consider  a  file  high-priority  if  it  has  an  ac¬ 
cess  count  higher  than  2.  For  this  workload,  86%  of  the 
checkpointed  files  are  leaves,  whereas  the  rest  are  non¬ 
leaf  files.  Hence,  in  most  cases  bounds  can  be  provided. 
The  number  can  naturally  be  reconfigured  for  other  work¬ 
loads.  Thus,  files  that  are  accessed  more  than  twice  get 
precedence  in  checkpointing  compared  to  leaves. 

A  replication-based  filesystem  has  to  replicate  every 
file,  even  temporary  data  used  between  jobs.  This  is 
because  failures  could  render  such  data  as  unavailable. 
Tachyon  avoids  checkpointing  much  of  the  temporary 
files  created  by  frameworks.  This  is  because  checkpoint¬ 
ing  later  data  first  (leaves)  or  hot  files,  allows  frameworks 
or  users  to  delete  temporary  data  before  it  gets  check- 
pointed. 

Dealing  with  Large  Data  Sets.  As  observed  previously, 
working  sets  are  Zipf-distributed  [17,  Figure  2].  We  can 
therefore  store  in  memory  all  but  the  very  largest  datasets, 
which  we  avoid  storing  in  memory  altogether.  For  exam¬ 
ple,  the  distribution  of  input  sizes  of  MapReduce  jobs  at 
Facebook  is  heavy-tailed  [10,  Figure  3a].  Furthermore, 
96%  of  active  jobs  respectively  can  have  their  entire  data 
simultaneously  fit  in  the  corresponding  clusters’  mem- 
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Figure  4:  Edge  Checkpoint  Example.  Each  node  repre¬ 
sents  a  file.  Solid  nodes  denote  checkpointed  files,  while 
dotted  nodes  denote  uncheckpointed  files. 


ory  [10].  The  Tachyon  master  is  thus  configured  to  syn¬ 
chronously  write  datasets  above  the  defined  threshold  to 
disk.  In  addition,  Figure  3b  shows  that  file  requests  in  the 
aforementioned  Facebook  cluster  is  highly  bursty.  During 
bursts,  Edge  checkpointing  might  checkpoint  leafs  that 
are  far  apart  in  the  DAG.  As  soon  as  the  bursts  finish, 
Edge  checkpointing  starts  checkpointing  the  rest  of  the 
non-leaf  files.  Thus,  most  of  the  time  most  of  the  files 
in  memory  have  been  checkpointed  and  can  be  evicted 
from  memory  if  room  is  needed  (see  Section  3).  If  the 
memory  fills  with  files  that  have  not  been  checkpointed, 
Tachyon  checkpoints  them  synchronously  to  avoid  having 
to  recompute  long  lineage  chains.  In  summary,  all  but  the 
largest  working  sets  are  stored  in  memory  and  most  data 
has  time  to  be  checkpointed  due  to  the  bursty  behavior  of 
frameworks.  Thus,  evictions  of  uncheckpointed  files  are 
rare. 

4.2  Bounded  Recovery  Time 

Checkpointing  the  edge  of  the  DAG  lets  us  derive  a  bound 
on  the  recomputation  time.  The  key  takeaway  of  the 
bound  is  that  recovery  of  any  file  takes  on  the  order  of 
time  that  it  takes  reading  or  generating  an  edge.  Infor¬ 
mally,  it  is  independent  of  the  depth  of  the  lineage  DAG. 

Recall  that  the  algorithm  repeatedly  checkpoints  the 
edge  of  the  graph.  We  refer  to  the  time  it  takes  to  check¬ 
point  a  particular  edge  i  of  the  DAG  as  Wi.  Similarly, 
we  refer  to  the  time  it  takes  to  generate  an  edge  i  from  its 
ancestors  as  Gi .  We  now  have  the  following  bound. 

Theorem  1  Edge  checkpointing  ensures  that  any  file  can 
be  recovered  in  3  xM,  for  M  =  max^T}},  7}  = 
ma x(Wi,Gi). 

Proof  Sketch  Consider  requesting  a  file  /  that  had  been 
fully  generated  but  is  no  longer  available.  If  /  is  check- 
pointed,  it  can  be  read  in  time  less  than  Wf  <  3 M,  prov¬ 
ing  the  bound.  If  /  is  not  checkpointed,  then  consider  the 


edge  l  that  was  last  fully  checkpointed  before  /  was  gen¬ 
erated.  Assume  checkpointing  of  l  started  at  time  t.  Then 
at  time  t  +  7}  +  M  the  computation  had  progressed  to 
the  point  that  /  had  been  fully  generated.  This  is  because 
otherwise,  due  to  Edge  checkpointing,  l  would  not  be  the 
last  fully  checkpointed  edge,  but  some  other  edge  that  was 
generated  later  but  before  /  was  generated.  Hence,  l  can 
be  read  in  time  7]  <  M,  and  in  the  next  27}  <  2M  time 
the  rest  of  the  lineage  can  be  computed  until  /  has  been 
fully  generated. 

This  shows  that  recomputations  are  independent  of  the 
“depth”  of  the  DAG.  This  assumes  that  the  caching  be¬ 
havior  is  the  same  during  the  recomputation,  which  is  true 
when  working  sets  fit  in  memory  (c.f,  Section  4.1). 

The  above  bound  does  not  apply  to  priority  checkpoint¬ 
ing.  However,  we  can  easily  incorporate  priorities  by  al¬ 
ternating  between  checkpointing  the  edge  c  fraction  of  the 
time  and  checkpointing  high-priority  data  1— c  of  the  time. 

Corollary  2  Edge  checkpointing,  where  c  fraction  of  the 
time  is  spent  checkpointing  the  edge,  ensures  that  any  file 
can  be  recovered  in  3xcM,  for  M  =  max^{7}},  7}  = 
ma x(Wi,Gi). 

Thus,  configuring  c  =  0.5  checkpoints  the  edge  half  of 
the  time,  doubling  the  bound  of  Theorem  1 .  These  bounds 
can  be  used  to  provide  SLOs  to  applications. 

In  practice,  priorities  can  improve  the  recomputation 
cost.  In  the  evaluation(§7),  we  illustrate  actual  recompu¬ 
tation  times  in  practice  edge  caching. 

5  Resource  Allocation 

Although  the  Edge  algorithm  provides  a  bound  on  recom¬ 
putation  cost,  Tachyon  needs  a  resource  allocation  strat¬ 
egy  to  schedule  jobs  to  recompute  data  in  a  timely  man¬ 
ner.  In  addition,  Tachyon  must  respect  existing  resource 
allocation  policies  in  the  cluster,  such  as  fair  sharing  or 
priority. 

In  many  cases,  there  will  be  free  resources  for  recom¬ 
putation,  because  most  datacenters  are  only  30-50%  uti¬ 
lized.  However,  care  must  be  taken  when  a  cluster  is  full. 
Consider  a  cluster  fully  occupied  by  three  jobs,  Ji,  J2, 
and  J3,  with  increasing  importance  ( e.g .,  from  research, 
testing,  and  production).  There  are  two  lost  files,  T\  and 
F2,  requiring  recomputation  jobs  R\  and  R2.  J2  requests 
F2  only.  How  should  Tachyon  schedule  recomputations? 

One  possible  solution  is  to  statically  assign  part  of  the 
cluster  to  Tachyon,  e.g.,  allocate  25%  of  the  resources  on 
the  cluster  for  recomputation.  However,  this  approach 
limits  the  cluster’s  utilization  when  there  are  no  recom¬ 
putation  jobs.  In  addition,  the  problem  is  complicated  be¬ 
cause  many  factors  can  impact  the  design.  For  example,  in 
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Figure  5:  Resource  Allocation  Strategy  for  Priority  Based 
Scheduler. 


the  above  case,  how  should  Tachyon  adjust  R2s  priority 
if  F2  is  later  requested  by  the  higher  priority  job  J3? 

To  guide  our  design,  we  identify  three  goals: 

1.  Priority  compatibility:  If  jobs  have  priorities,  recom¬ 
putation  jobs  should  follow  them.  For  example,  if  a 
file  is  requested  by  a  low  priority  job,  the  recompu¬ 
tation  should  have  minimal  impact  on  higher  priority 
jobs.  But  if  the  file  is  later  requested  by  a  high  priority 
job,  the  recovery  job’s  importance  should  increase. 

2.  Resource  sharing:  If  there  are  no  recomputation  jobs, 
the  whole  cluster  should  be  used  for  normal  work. 

3.  Avoid  cascading  re  computation:  When  a  failure  oc¬ 
curs,  more  than  one  file  may  be  lost  at  the  same  time. 
Recomputing  them  without  considering  data  depen¬ 
dencies  may  cause  recursive  job  launching. 

We  start  by  presenting  resource  allocation  strategies 
that  meet  the  first  two  goals  for  common  cluster  schedul¬ 
ing  policies.  Then,  we  discuss  how  to  achieve  the  last 
goal,  which  is  orthogonal  to  the  scheduling  policy. 

5.1  Resource  Allocation  Strategy 

The  resource  allocation  strategy  depends  on  the  schedul¬ 
ing  policy  of  the  cluster  Tachyon  runs  on.  We  present 
solutions  for  priority  and  weighted  fair  sharing,  the  most 
common  policies  in  systems  like  Hadoop  and  Dryad 
[45,  27]. 

Priority  Based  Scheduler  In  a  priority  scheduler,  using 
the  same  example  above,  jobs  J\,  J2,  and  J3  have  priori¬ 
ties  Pi,  P2,  and  P3  respectively,  where  Pi  <  P2  <  P3. 

Our  solution  gives  all  recomputation  jobs  the  lowest 
priority  by  default,  so  they  have  minimal  impact  on  other 
jobs.  However,  this  may  cause  priority  inversion.  For  ex¬ 
ample,  because  file  F2s  recomputation  job  P2  has  a  lower 
priority  than  J2,  it  is  possible  that  J2  is  occupying  the 
whole  cluster  when  it  requests  P2 .  In  this  case,  P2  cannot 
get  resources,  and  J2  blocks  on  it. 

We  solve  this  by  priority  inheritance.  When  J2  requests 
P2,  Tachyon  increases  R2s  priority  to  be  P2.  If  F2  is  later 
read  by  J3,  Tachyon  further  increases  its  priority.  Fig¬ 
ure  5a  and  5b  show  jobs’  priorities  before  and  after  J3 


requests  F2 . 

Fair  Sharing  Based  Scheduler  In  a  hierarchical  fair 
sharing  scheduler,  jobs  Ji,  J2,  and  J3  have  shares  Wi, 
W2 ,  and  Ws  respectively.  The  minimal  share  unit  is  1 . 

In  our  solution,  Tachyon  has  a  default  weight,  Wr  (as 
the  minimal  share  unit  1),  shared  by  all  recomputation 
jobs.  When  a  failure  occurs,  all  lost  files  are  recomputed 
by  jobs  with  a  equal  share  under  Wr.  In  our  example, 
both  Pi  and  R2  are  launched  immediately  with  share  1  in 
WR. 

When  a  job  requires  lost  data,  part  of  the  requesting 
job’s  share5,  is  moved  to  the  recomputation  job.  In  our 
example,  when  J2  requests  P2,  J2  has  share  (1  —  a)  under 
W2 ,  and  R2  share  a  under  W2 .  When  J3  requests  F2  later, 
J3  has  share  1 —a  under  Ws  and  R2  has  share  a  under  Ws- 
When  R2  finishes,  J2  and  J3  resumes  all  of  their  previous 
shares,  W2  and  Ws  respectively.  Figure  6  illustrates. 

This  solution  fulfills  our  goals,  in  particular,  priority 
compatibility  and  resource  sharing.  When  no  jobs  are  re¬ 
questing  a  lost  file,  the  maximum  share  for  all  recompu¬ 
tation  jobs  is  bounded.  In  our  example,  it  is  Wr/(W\  + 
W2  +  Ws  +  Wr).  When  a  job  requests  a  missing  file,  the 
share  of  the  corresponding  recomputation  job  is  increased. 
Since  the  increased  share  comes  from  the  requesting  job, 
there  is  no  performance  impact  on  other  normal  jobs. 

5.2  Recomputation  Order 

Recomputing  a  file  might  require  recomputing  other  files 
first,  such  as  when  a  node  fails  and  loses  multiple  files 
at  the  same  time.  While  the  programs  could  recursively 
make  callbacks  to  the  workflow  manager  to  recompute 
missing  files,  this  would  have  poor  performance.  For  in¬ 
stance,  if  the  jobs  are  non-preemptable,  computation  slots 
are  occupied,  waiting  for  other  recursively  invoked  files 
to  be  reconstructed.  If  the  jobs  are  preemptable,  computa¬ 
tion  before  loading  lost  data  is  wasted.  For  these  reasons, 
the  workflow  manager  determines  in  advance  the  order  of 
the  files  that  need  to  be  recomputed  and  schedules  them. 

To  determine  the  files  that  need  to  be  recomputed,  the 
workflow  manager  uses  a  logical  directed  acyclic  graph 
(DAG)  for  each  file  that  needs  to  be  reconstructed.  Each 
node  in  the  DAG  represents  a  file.  The  parents  of  a  child 
node  in  the  DAG  denote  the  files  that  the  child  depends 
on.  That  is,  for  a  wide  dependency  a  node  has  an  edge  to 
all  files  it  was  derived  from,  whereas  for  a  narrow  depen¬ 
dency  it  has  a  single  edge  to  the  file  that  it  was  derived 
from.  This  DAG  is  a  subgraph  of  the  DAG  in  Section  4.1. 

To  build  the  graph,  the  workflow  manager  does  a  depth- 
first  search  (DFS)  of  nodes  representing  targeted  files. 

5 a  could  be  a  fixed  portion  of  the  job’s  share,  e.g.,  20% 
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Figure  6:  Resource  Allocation  Strategy  for  Fair  Sharing  Based  Scheduler. 


The  DFS  stops  whenever  it  encounters  a  node  that  is  al¬ 
ready  available  in  storage.  The  nodes  visited  by  the  DFS 
must  be  recomputed.  The  nodes  that  have  no  lost  parents 
in  the  DAG  can  be  recomputed  first  in  parallel.  The  rest 
of  nodes  can  be  recomputed  when  all  of  their  children  be¬ 
come  available.  The  workflow  manager  calls  the  resource 
manager  and  executes  these  tasks  to  ensure  the  recompu¬ 
tation  of  all  missing  data. 

6  Implementation 

This  section  describes  the  detailed  information  needed  to 
construct  a  lineage  and  Tachyon’s  integration  with  the 
eco-system. 

6.1  Lineage  Metadata 

Ordered  input  files  list:  Because  files’  names  could  be 
changed,  each  file  is  identified  by  a  unique  immutable  file 
ID  in  the  ordered  list  to  ensure  that  the  application’s  po¬ 
tential  future  recomputations  read  the  same  files  in  the 
same  order  as  its  first  time  execution. 

Ordered  output  files  list:  This  list  shares  the  same  in¬ 
sights  as  the  input  files  list. 

Binary  program  for  recomputation:  Tachyon  launches 
this  program  to  regenerate  files  when  necessary.  There 
are  various  approaches  to  implement  a  file  recomputation 
program.  One  naive  way  is  to  write  a  specific  program  for 
each  application.  However,  this  significantly  burdens  ap¬ 
plication  programmers.  Another  solution  is  to  write  a  sin¬ 
gle  wrapper  program  which  understands  both  Tachyon’s 
lineage  information  and  the  application’s  logic.  Though 
this  may  not  be  feasible  for  all  programs,  it  works  for  ap¬ 
plications  written  in  a  particular  framework.  Each  frame¬ 
work  can  implement  a  wrapper  to  allow  applications  writ¬ 
ten  in  the  framework  to  use  Tachyon  transparently.  There¬ 
fore,  no  burden  will  be  placed  on  application  program¬ 
mers. 

Program  configuration:  Program  configurations  can  be 
dramatically  different  in  various  jobs  and  frameworks.  We 
address  this  by  having  Tachyon  forego  any  attempt  to 
understand  these  configurations.  Tachyon  simply  views 
them  as  byte  arrays,  and  leaves  the  work  to  program 


wrappers  to  understand.  Based  on  our  experience,  it  is 
fairly  straightforward  for  each  framework’s  wrapper  pro¬ 
gram  to  understand  its  own  configuration.  For  example, 
in  Hadoop,  configurations  are  kept  in  HadoopConf  \  while 
Spark  stores  these  in  SparkEnv.  Therefore,  their  wrap¬ 
per  programs  can  serialize  them  into  byte  arrays  during 
lineage  submission,  and  deserialize  them  during  recom¬ 
putation. 

Dependency  type:  We  use  wide  and  narrow  dependen¬ 
cies  for  efficient  recovery(c./,  §5).  Narrow  dependencies 
represent  programs  that  do  operations,  e.g.,  filter  and  map, 
where  each  output  file  only  requires  one  input  file.  Wide 
dependencies  represent  programs  that  do  operations,  e.g., 
shuffle  and  join,  where  each  output  file  requires  more  than 
one  input  file.  This  works  similarly  to  Spark  [46]. 

When  a  program  written  in  a  framework  runs,  before  it 
writes  files,  it  provides  the  aforementioned  information  to 
Tachyon.  Then,  when  the  program  writes  files,  Tachyon 
recognizes  the  files  contained  in  the  lineage.  Therefore, 
the  program  can  write  files  to  memory  only,  and  Tachyon 
relies  on  the  lineage  to  achieve  fault  tolerance.  If  any  file 
gets  lost,  and  needs  to  be  recomputed,  Tachyon  launches 
the  binary  program,  a  wrapper  under  a  framework  invok¬ 
ing  user  application’s  logic,  which  is  stored  in  the  cor¬ 
responding  lineage  instance,  and  provides  the  lineage  in¬ 
formation  as  well  as  lost  files  list  to  the  recomputation 
program  to  regenerate  the  data. 

6.2  Integration  with  the  eco-system 

We  have  implemented  patches  for  existing  frameworks 
to  work  with  Tachyon:  300  Lines-of-Code  (LoC)  for 
Spark  [46]  and  200  LoC  for  MapReduce  [3].  In  addi¬ 
tion,  in  case  of  a  failure,  recomputation  can  be  done  at 
file  level.  For  example,  if  a  MapReduce  job  produces  10 
files  and  if  only  one  file  gets  lost,  Tachyon  can  launch  the 
corresponding  job  to  only  recompute  the  single  lost  file. 
Applications  on  top  of  integrated  frameworks  take  advan¬ 
tage  of  the  linage  transparently,  and  application  program¬ 
mers  do  not  need  to  know  the  lineage  concept. 
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Figure  7:  Tachyon  and  MemHDFS  throughput  compari¬ 
son.  On  average,  Tachyon  outperforms  MemHDFS  11  Ox 
for  write  throughput,  and  2x  for  read  throughput. 

7  Evaluation 

We  evaluated  Tachyon  through  a  series  of  raw  bench¬ 
marks  and  experiments  based  on  real-world  workloads. 

Unless  otherwise  noted,  our  experiments  ran  on  an 
Amazon  EC2  cluster  with  10  Gbps  Ethernet.  Each  node 
had  32  cores,  244GB  RAM,  and  240GB  of  SSD.  We  used 
the  latest  versions  of  Hadoop  (2.3.0)  and  Spark  (0.9). 

We  compare  Tachyon  with  an  in-memory  installa¬ 
tion  of  Hadoop’s  HDFS  (over  RAMFS),  which  we  dub 
MemHDFS.  MemHDFS  still  replicates  data  across  the 
network  for  writes  but  eliminates  the  slowdown  from  disk. 
In  summary,  our  results  show  the  following: 

•  Tachyon  can  write  data  1  lOx  faster  than  MemHDFS. 

•  Tachyon  speeds  up  a  realistic  multi-job  workflow  by 
4x  over  MemHDFS.  In  case  of  failure,  it  recovers 
around  one  minute  and  still  finishes  3.8x  faster. 

•  The  Edge  algorithm  outperforms  any  fixed  check¬ 
pointing  interval. 

•  Recomputation  would  consume  less  than  1.6%  of 
cluster  resources  in  traces  from  Facebook  and  Bing. 

•  Analysis  shows  that  Tachyon  can  reduce  replication- 
caused  network  traffic  up  by  to  50%. 

•  Tachyon  helps  existing  in-memory  frameworks  like 
Spark  improve  latency  by  moving  storage  off-heap. 

•  Tachyon  recovers  from  master  failure  within  1  second. 

7.1  Raw  Performance 

We  first  compare  Tachyon’ s  write  and  read  throughputs 
with  MemHDFS.  In  each  experiment,  we  ran  32  processes 
on  each  cluster  node  to  write/read  1GB  each,  equivalent 
to  32GB  per  node.  Both  Tachyon  and  MemHDFS  scaled 
linearly  with  number  of  nodes.  Figure  7  shows  our  results. 

For  writes,  Tachyon  achieves  15GB/sec/node.  Despite 
using  lOGbps  Ethernet,  MemHDFS  write  throughput  is 
0.14GB/sec/node,  with  a  network  bottleneck  due  to  3- 
way  replication  for  fault  tolerance.  We  also  show  the 
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Figure  8:  Performance  comparison  for  realistic  workflow. 
Each  number  is  the  average  of  three  runs.  The  workflow 
ran  4x  faster  on  Tachyon  than  on  MemHDFS.  In  case  of 
node  failure,  applications  recovers  in  Tachyon  around  one 
minute  and  still  finishes  3.8x  faster. 


theoretical  maximum  performance  for  replication  on  this 
hardware:  using  only  two  copies  of  each  block,  the  limit 
is  0.5GB/sec/node.  On  average,  Tachyon  outperforms 
MemHDFS  by  1  lOx,  and  the  theoretical  replication-based 
write  limit  by  3 Ox. 

For  reads,  Tachyon  achieves  3 8GB /sec/node.  We  opti¬ 
mized  HDFS  read  performance  using  two  of  its  most  re¬ 
cent  features,  HDFS  caching  and  short-circuit  reads.  With 
these  features,  MemHDFS  achieves  17  GB /sec/node.  The 
reason  Tachyon  performs  better  is  that  the  HDFS  API  still 
requires  an  extra  memory  copy  due  to  Java  I/O  streams. 

Note  that  Tachyon’ s  read  throughput  was  higher  than 
write.  This  happens  simply  because  memory  hardware  is 
generally  optimized  to  leave  more  bandwidth  for  reads. 

7.2  Realistic  Workflow 

In  this  experiment,  we  test  how  Tachyon  performs  with 
a  realistic  workload.  The  workflow  is  modeled  after  jobs 
run  at  a  leading  video  analytics  company  during  one  hour. 
It  contains  periodic  extract,  transform  and  load  (ETL)  and 
metric  reporting  jobs.  Many  companies  run  similar  work- 
flows. 

The  experiments  ran  on  a  30-node  EC2  cluster.  The 
whole  workflow  contains  240  jobs  in  20  batches  (8  Spark 
jobs  and  4  MapReduce  jobs  per  batch).  Each  batch  of 
jobs  read  1  TB  and  produced  500  GB.  We  used  the  Spark 
Grep  job  to  emulate  ETL  applications,  and  MapReduce 
Word  Count  to  emulate  metric  analysis  applications.  For 
each  batch  of,  we  ran  two  Grep  applications  to  pre-process 
the  data.  Then  we  ran  Word  Count  to  read  the  cleaned 
data  and  compute  the  final  results.  After  getting  the  final 
results,  the  cleaned  data  was  deleted. 

We  measured  the  end-to-end  latency  of  the  workflow 
running  on  Tachyon  or  MemHDFS.  To  simulate  the  real 
scenario,  we  started  the  workload  as  soon  as  raw  data 
had  been  written  to  the  system,  in  both  Tachyon  and 
MemHDFS  tests.  For  the  Tachyon  setting,  we  also  mea¬ 
sured  how  long  the  workflow  took  with  a  node  failure. 
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Figure  10:  Theoretically,  recomputation  consumes  up  to 
Figure  9:  Edge  and  fixed  interval  checkpoint  recovery  per-  30%  of  a  cluster’s  resource  in  the  worst  case, 
formance  comparison. 


Figure  8  shows  the  workflow’s  performance  on 
Tachyon  and  MemHDFS.  The  pipeline  ran  in  16.6  min¬ 
utes  on  Tachyon  and  67  minutes  on  HDFS.  The  speedup 
is  around  4x.  When  a  failure  happens  in  Tachyon,  the 
workflow  took  1  more  minute,  still  finishing  3.8x  faster 
than  MemHDFS. 

With  Tachyon,  the  main  overhead  was  serialization  and 
de- serialization  since  we  used  the  Hadoop  TextlnputFor- 
mat.  With  a  more  efficient  serialization  format,  the  per¬ 
formance  gap  is  larger. 

7.3  Edge  Checkpointing  Algorithm 

We  evaluate  the  Edge  algorithm  by  comparing  it  with 
fixed-interval  checkpointing.  We  simulate  an  iterative 
workflow  with  100  jobs,  whose  execution  time  follows  a 
Gaussian  distribution  with  a  mean  of  10  seconds  per  job. 
The  output  of  each  job  in  the  workflow  requires  a  fixed 
time  of  15  seconds  to  checkpoint.  During  the  workflow, 
one  node  fails  at  a  random  time. 

Figure  9  compares  the  average  recovery  time  of  this 
workflow  under  Edge  checkpointing  with  various  fixed 
checkpoint  intervals.  We  see  that  Edge  always  outper¬ 
forms  any  fixed  checkpoint  interval.  When  too  small  an 
interval  picked,  checkpointing  cannot  keep  up  with  pro¬ 
gram  progress  and  starts  lagging  behind.6  If  the  interval 
is  too  large,  then  the  recovery  time  will  suffer  as  the  last 
checkpoint  is  too  far  back  in  time.  Furthermore,  even  if 
an  optimal  average  checkpoint  interval  is  picked,  it  can 
perform  worse  than  the  Edge  algorithm,  which  inherently 
varies  its  interval  to  always  match  the  progress  of  the  com¬ 
putation  and  can  take  into  account  the  fact  that  different 
jobs  in  our  workflow  take  different  amounts  of  time. 

We  also  simulated  other  variations  of  this  workload, 
e.g.,  more  than  one  lineage  chain  or  different  average  job 
execution  times  at  different  phases  in  one  chain.  These 

6 That  is,  the  system  is  still  busy  checkpointing  data  from  far  in  the 
past  when  a  failure  happens  later  in  the  lineage  graph. 


simulations  have  a  similar  result,  with  the  gap  between 
Edge  algorithm  and  the  best  fixed  interval  being  larger  in 
more  variable  workloads. 

7.4  Impact  of  Recomputation  on  Other 
Jobs 

In  this  experiment,  we  show  that  recomputating  lost  data 
does  not  noticeably  impact  other  users’  jobs  that  do  not 
depend  on  the  lost  data.  The  experiment  has  two  users, 
each  running  a  Spark  ETL  pipeline.  We  ran  the  test  three 
times,  and  report  the  average.  Without  a  node  failure,  both 
users’  pipelines  executed  in  85  seconds  on  average  (stan¬ 
dard  deviation:  3s).  With  a  failure,  the  unimpacted  users ’s 
execution  time  was  86s  (std.dev.  3.5s)  and  the  impacted 
user’s  time  was  114s  (std.dev.  5.5s). 

7.5  Recomputation  Resource  Consumption 

Since  Tachyon  relies  on  lineage  information  to  recompute 
missing  data,  it  is  critical  to  know  how  many  resources 
will  be  spent  on  recomputation,  given  that  failures  hap¬ 
pen  every  day  in  large  clusters.  In  this  section,  we  calcu¬ 
late  the  amount  of  resources  spent  recovering  using  both  a 
mathematical  model  and  traces  from  Facebook  and  Bing. 
We  make  our  analysis  using  the  following  assumptions: 

•  Mean  time  to  failure  (MTTF)  for  each  machine  is  3 
years.  If  a  cluster  contains  1000  nodes,  on  average, 
there  is  one  node  failure  per  day. 

•  Sustainable  checkpoint  throughput  is  200MB/s/node. 

•  Resource  consumption  is  measured  in  machine-hours. 

•  In  this  analysis,  we  assume  Tachyon  only  uses  the 
coarse-gained  recomputation  at  the  job  level  to  com¬ 
pute  worst  case,  even  though  it  supports  fine-grained 
recomputation  at  task  level. 

Worst-case  analysis  In  the  worst  case,  when  a  node 
fails,  its  memory  contains  only  un-checkpointed  data. 
This  requires  tasks  that  generate  output  faster  than 
200MB /sec:  otherwise,  data  can  be  checkpointed  in  time. 
If  a  machine  has  128GB  memory,  it  requires  655  seconds 
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Figure  11:  Using  the  trace  from  Facebook  and  Bing,  recomputation  consumes  to  up  0.9%  and  1.6%  of  the  resource  in 
the  worst  case  respectively. 


(128GB  /  200MB/sec)  to  recompute  the  lost  data.  Even 
if  this  data  is  recovered  serially,  and  of  all  the  other  ma¬ 
chines  are  blocked  waiting  on  the  data  during  this  pro¬ 
cess  ( e.g .,  they  were  running  a  highly  parallel  job  that  de¬ 
pended  on  it),  recomputation  takes  0.7%  (655  seconds  / 
24  hours)  of  the  cluster’s  running  time  on  a  1000-node 
cluster  (with  one  failure  per  day).  This  cost  scales  lin¬ 
early  with  the  cluster  size  and  memory  size,  as  shown  in 
Figure  10.  For  a  cluster  with  5000  nodes,  each  with  1TB 
memory,  the  upper  bound  on  recomputation  cost  is  30% 
of  the  cluster  resources,  which  is  still  small  compared  to 
the  typical  speedup  from  Tachyon. 

Real  traces  In  real  workloads,  the  recomputation  cost  is 
much  lower  than  in  the  worst-case  setting  above,  because 
individual  jobs  rarely  consume  the  entire  cluster,  so  a  node 
failure  may  not  block  all  other  nodes.  (Another  reason  is 
that  data  blocks  on  a  failed  machine  can  often  be  recom¬ 
puted  in  parallel,  but  we  do  not  quantify  this  here.)  Fig¬ 
ure  11  estimates  these  costs  based  on  job  size  traces  from 
Facebook  and  Bing  (from  Table  2  in  [11]),  performing  a 
similar  computation  as  above  with  the  active  job  sizes  in 
these  clusters.  With  the  same  5000-node  cluster,  recom¬ 
putation  consumes  only  up  to  0.9%  and  1.6%  of  resources 
at  Facebook  and  Bing  respectively.  Given  most  clusters 
are  only  30-50%  utilized,  this  overhead  is  negligible. 

7.6  Network  Traffic  Reduction 

Data  replication  from  the  filesystem  consumes  almost 
half  the  cross-rack  traffic  in  data-intensive  clusters  [19]. 
Because  Tachyon  checkpoints  data  asynchronously  some 
time  after  it  was  written,  it  can  avoid  replicating  short¬ 
lived  files  altogether  if  they  are  deleted  before  Tachyon 
checkpoints  them,  and  thus  reduce  this  traffic. 

We  analyze  Tachyon ’s  bandwidth  savings  via  simula¬ 
tions  with  the  following  parameters: 

•  Let  T  be  the  ratio  between  the  time  it  takes  to  check¬ 
point  a  job’s  output  and  the  time  to  execute  it.  This 


depends  on  how  IO-bound  the  application  is.  For 
example,  we  measured  a  Spark  Grep  program  us¬ 
ing  Hadoop  Text  Input  format,  which  resulted  in  T  = 
4.5,  i.e.,  the  job  runs  4.5x  faster  than  replicating  data 
across  network.  With  a  more  efficient  binary  format, 
T  will  be  larger. 

•  Let  X  be  the  percent  of  jobs  that  output  permanent 
data.  For  example,  60%  (X  =  60)  of  generated  data 
got  deleted  within  16  minutes  at  Facebook  (Fig.  3a). 

•  Let  Y  be  the  percentage  of  jobs  that  read  output  of 
previous  jobs.  If  Y  is  100,  the  lineage  is  a  chain.  If  Y 
is  0,  the  depth  of  the  lineage  is  1 .  At  a  leading  Internet 
messaging  company,  Y  is  84%. 

Based  on  this  information,  we  set  X  as  60  and  Y  as 
84.  We  simulated  1000  jobs  using  Edge  checkpointing. 
Depending  on  T,  the  percent  of  network  traffic  saved  over 
replication  ranges  from  40%  at  T  =  4  to  50%  at  T  >  10. 

7.7  Overhead  in  Single  Job 

When  running  a  single  job  instead  of  a  pipeline,  we  found 
that  Tachyon  imposes  minimal  overhead,  and  can  improve 
performance  over  current  in-memory  frameworks  by  re¬ 
ducing  garbage  collection  overheads.  We  use  Spark  as 
an  example,  running  a  Word  Count  job  on  one  worker 
node.  Spark  can  natively  cache  data  either  as  deserial¬ 
ized  Java  objects  or  as  serialized  byte  arrays,  which  are 
more  compact  but  create  more  processing  overhead.  We 
compare  these  modes  with  caching  in  Tachyon.  For  small 
data  sizes,  execution  times  are  similar.  When  the  data 
grows,  however,  Tachyon  storage  is  faster  than  Spark’s  na¬ 
tive  modes  because  it  avoids  Java  memory  management.7 
These  results  show  that  Tachyon  can  be  a  drop-in  alterna¬ 
tive  for  current  in-memory  frameworks. 


7  Although  Tachyon  is  written  in  Java,  it  stores  data  in  a  Linux 
RAMFS. 
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7.8  Master  Fault  Tolerance 

Tachyon  utilizes  hot  failovers  to  achieve  fast  master  re¬ 
covery.  We  tested  recovery  for  an  instance  with  1  to  5 
million  files,  and  found  that  the  failover  node  resumed 
the  master’s  role  after  acquiring  leadership  within  0.5  sec¬ 
onds,  with  a  standard  deviation  of  0.1  second.  This  perfor¬ 
mance  is  possible  because  the  failover  constantly  updates 
its  file  metadata  based  on  the  log  of  the  current  master. 

8  Related  Work 

Storage  Systems  Distributed  file  systems  [14,  39,  42], 
e.g.,  GFS  [23]  and  FDS  [13],  and  key/value  stores  [1,  12, 
22],  e.g.,  RAMCloud  [34]  and  HBase  [4],  replicate  data  to 
different  nodes  for  fault-tolerance.  Their  write  through¬ 
put  is  bottlenecked  by  network  bandwidth.  FDS  uses  a 
fast  network  to  achieve  higher  throughput.  Despite  the 
higher  cost  of  building  FDS,  its  throughput  is  still  far  from 
memory  throughput.  Our  key  contribution  with  respect  to 
this  work  is  leveraging  the  lineage  concept  in  the  storage 
layer  to  eschew  replication  and  instead  store  a  single  in¬ 
memory  copy  of  files. 

Computation  Frameworks  Spark  [46]  uses  lineage  in¬ 
formation  within  a  single  job  or  shell,  all  running  inside 
a  single  JVM.  Different  queries  in  Spark  cannot  share 
datasets  (RDD)  in  a  reliable  and  high-throughput  fashion, 
because  Spark  is  a  computation  engine,  rather  than  a  stor¬ 
age  system.  Our  integration  with  Spark  substantially  im¬ 
proves  existing  industry  workflows  of  Spark  jobs,  as  they 
can  share  datasets  reliably  through  Tachyon.  Moreover, 
Spark  can  benefit  from  the  asynchronous  checkpointing 
in  Tachyon,  which  enables  high-throughput  write. 

Other  frameworks,  such  as  MapReduce  [20]  and 
Dryad  [26],  also  trace  task  lineage  within  a  job.  However, 
as  execution  engines,  they  do  not  trace  relations  among 
files,  and  therefore  can  not  provide  high  throughput  data 
sharing  among  different  jobs.  Like  Spark,  they  can  also 
integrate  with  Tachyon  to  improve  the  efficiency  of  data 
sharing  among  different  jobs  or  frameworks. 

Caching  Systems  Like  Tachyon,  Nectar  [24]  also  uses  the 
concept  of  lineage,  but  it  does  so  only  for  a  specific  pro¬ 
gramming  framework  (DryadLINQ  [44]),  and  in  the  con¬ 
text  of  a  traditional,  replicated  file  system.  Nectar  is  a  data 
reuse  system  for  DryadLINQ  queries  whose  goals  are  to 
save  space  and  to  avoid  redundant  computations.  The  for¬ 
mer  goal  is  achieved  by  deleting  largely  unused  files  and 
rerunning  the  jobs  that  created  them  when  needed.  How¬ 
ever,  no  time  bound  is  provided  to  retrieve  deleted  data. 
The  latter  goal  is  achieved  by  identifying  pieces  of  code 
that  are  common  in  different  programs  and  reusing  previ¬ 
ously  computed  files.  Nectar  achieves  this  by  heavily  rest¬ 
ing  on  the  SQL  like  DryadLINQ  query  semantics — in  par¬ 


ticular,  it  needs  to  analyze  LINQ  code  to  determine  when 
results  may  be  reused — and  stores  data  in  a  replicated  on- 
disk  file  system  rather  than  attempting  to  speed  up  data 
access.  In  contrast,  Tachyon ’s  goal  is  to  provide  data  shar¬ 
ing  across  different  frameworks  with  memory  speed  and 
bounded  recovery  time. 

Lineage  Based  Storage  Systems  and  Databases  Previ¬ 
ous  file  systems  [32]  and  databases  [18]  also  use  lineage 
information,  which  is  called  provenance  in  their  contexts. 
Unlike  Tachyon,  their  goals  are  to  provide  data  security, 
verification,  etc.  Tachyon  is  the  first  system  to  push  lin¬ 
eage  into  storage  layer  to  improve  performance,  which  en¬ 
tails  a  different  set  of  challenges. 

Checkpoint  Research  Checkpointing  has  been  a  rich  re¬ 
search  area.  Much  of  the  research  was  on  using  check¬ 
points  to  minimize  the  re-execution  cost  when  failures 
happen  during  long  jobs.  For  instance,  much  focus  was  on 
optimal  checkpoint  intervals  [41,  43],  as  well  as  reducing 
the  per-checkpoint  overhead  [21,  35,  36].  Unlike  previous 
work,  which  uses  synchronous  checkpoints,  Tachyon  does 
checkpointing  asynchronously  in  the  background,  which 
is  enabled  by  using  lineage  information  to  recompute  any 
missing  data  if  a  checkpoint  fails  to  finish. 

9  Limitations  and  Future  Work 

Tachyon  aims  to  improve  the  performance  for  its  targeted 
workloads(§2.1),  and  the  evaluations  show  promising  re¬ 
sults.  Although  many  big  data  clusters  are  running  our  tar¬ 
geted  workloads,  we  realize  that  there  are  cases  in  which 
Tachyon  provides  limited  improvement,  e.g.,  CPU  or  net¬ 
work  intensive  jobs.  In  addition,  there  are  also  challenges 
that  future  work  needs  to  address: 

Mutable  data:  This  is  challenging  as  lineage  cannot  gen¬ 
erally  be  efficiently  stored  for  fine-grained  random-access 
updates.  However,  there  are  several  directions,  such  as 
exploiting  deterministic  updates  and  batch  updates. 
Multi-tenancy:  Memory  fair  sharing  is  an  important  re¬ 
search  direction  for  Tachyon.  Policies  like  LRU/LFU 
might  provide  good  overall  performance  at  the  expense 
of  providing  isolation  guarantees  to  individual  users. 
Hierarchical  storage:  Though  memory  capacity  grows 
exponentially  each  year,  it  is  still  comparatively  expen¬ 
sive  to  its  alternatives.  One  early  adopter  of  Tachyon  sug¬ 
gested  that  besides  utilizing  the  memory  layer,  Tachyon 
should  also  leverage  NVRAM  and  SSDs.  In  the  future, 
we  will  investigate  how  to  support  hierarchical  storage  in 
Tachyon. 

10  Conclusion 

As  ever  more  datacenter  workloads  start  to  be  in  memory, 
write  throughput  becomes  a  major  bottleneck  for  applica- 
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tions.  Therefore,  we  believe  that  lineage-based  recovery 
might  be  the  only  way  to  speed  up  cluster  storage  systems 
to  achieve  memory  throughput.  We  proposed  Tachyon,  a 
storage  system  that  incorporates  lineage  to  speed  up  the 
significant  part  of  the  workload  consisting  of  determinis¬ 
tic  batch  jobs.  We  identify  and  address  some  of  the  key 
challenges  in  making  Tachyon  practical.  Our  evaluations 
show  that  Tachyon  provides  promising  speedups  over  ex¬ 
isting  storage  alternatives.  Tachyon  is  open  source  with 
contributions  from  more  than  40  individuals  and  over  10 
companies. 
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